library(climatic4economist)1: Extract spatial data based on spatial point
1 Introduction
This tutorial show how too extract spatial control variables based on surveys coordinate locations. The survey refers to Ethiopia 2019 and comes from the The World Bank Living Standards Measurement Study (LSMS). The spatial variables are nighttime light, agroecological zones, Urban-Rural Catchment Area, elevation, and climatic parameters. More information about these datasets are in the Appendix.
2 Code
2.1 Set Up
We start by setting up the stage for our analysis. First, we load the necessary packages. We load only climatic4economist package that contains several functions meant to extract and merge spatial variables with surveys. During the tutorial we will use other packages but instead of loading all the package at the begging we will call specific function each time.
In the setup, we also want to create the paths to the various data sources and load the necessary functions for extraction. Note .. means one step back to the folder directory, i.e. one folder back.
Note that how to set up the paths depends on your folder organization but there are overall two approaches:
- you can use the
R project, by opening the project directly you don’t need to set up the path to the project. Automatically the project figures out on its own where it is located in the computer and set that path as working folder. - you can manually set the working folder with the function
setwd().
# path to data folder
path_to_data <- file.path("..",
"..", "data")
# survey and administrative division
path_to_survey <- file.path(path_to_data, "survey", "LSMS", "LSMS_ETH19.dta")
path_to_adm_div <- file.path(path_to_data, "adm_div", "geoBoundaries")
# weather variables
path_to_pre <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
"afr_month_50_25_tpr.nc")
path_to_tmp <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
"afr_month_50_25_tmp.nc")
# control variables
path_to_elevation <- file.path(path_to_data, "spatial", "elevation", "GloFAS",
"elevation_glofas_v4_0.nc")
path_to_urca <- file.path(path_to_data, "spatial", "URCA",
"URCA.tif")
path_to_pop <- file.path(path_to_data, "spatial", "population", "WorldPop",
"uncontraint_1km_global", "ppp_2019_1km_Aggregated.tif")
path_to_nightlight <- file.path(path_to_data, "spatial", "nighttime_light",
"VIIRS", "VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif")
path_to_aez <- file.path(path_to_data, "spatial", "AgroEcological", "AEZ",
"GAEZv5", "GAEZ-V5.AEZ33-10km.tif")
# to result folder
path_to_result <- file.path(path_to_data, "result")- 1
- concatenate the string to make a path
- 2
-
..means one folder back
2.2 Read the Data
2.2.1 Survey Data
We start by reading the surveys data. The survey is stored as dta file, so we use the haven::read_dta() function to read it.
We only need the hhid, the survey coordinates, and the interview dates. We use dplyr::select() to choose these variables. This passage is optional and we bring with us all the variables, but we won’t use them.
Then we create/modify some variables with the function dplyr::mutate(). We transform the the variable interview_date from string into data, and we get the year of the median value of the date of interviews. This passage is important as it allows us to define the most appropriate year to select for the spatial variables.
srvy <- haven::read_dta(path_to_survey) |>
dplyr::select(survey_year, hhid, country, lat, lon, interview_date) |>
dplyr::mutate(
interview_date = clock::date_parse(interview_date,
format = "%Y-%m-%d"),
survey_year = clock::get_year(median(interview_date)),
.before = hhid)- 1
- read dta type data
- 2
- select relevant variables
- 3
- transform string into date type
- 4
- specify format type
- 5
- find the median year of the interviews
2.2.2 Spatial Data
Finally, we load the spatial data. This data typically comes in the form of raster data. A raster represents a two-dimensional image as a rectangular matrix or grid of pixels. These are spatial rasters because they are georeferenced, meaning each pixel (or “cell” in GIS terms) represents a square region of geographic space. The value of each cell reflects a measurable property (either qualitative or quantitative) of that region.
To spatial data is usually stored as tif file or nc. We can read both of them them with the function terra::rast().
When we print the raster, we obtain several key details. The dimension tells us how many cells the raster consists of and the number of layers, each layer corresponds to a particular months for which the observations were made. We also get the spatial resolution, which defines the size of each square region in geographic space, and the coordinate reference system (CRS), i.e. EPSG:4326.
When working with multiple spatial data, you must ensure that they have the same coordinate reference system (CRS). This is important because in this way all the data can “spatially” talk to each other.
nightlight <- terra::rast(path_to_nightlight) |>
setNames("nightlight")
nightlight
elevation <- terra::rast(path_to_elevation)
elevation
urca <- terra::rast(path_to_urca)
urca
aez <- terra::rast(path_to_aez) |>
setNames("aez")
aez- 1
- read raster type data
- 2
- change the name of the layer
class : SpatRaster
dimensions : 33601, 86401, 1 (nrow, ncol, nlyr)
resolution : 0.004166667, 0.004166667 (x, y)
extent : -180.0021, 180.0021, -65.00208, 75.00208 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif
name : nightlight
class : SpatRaster
dimensions : 3000, 7200, 1 (nrow, ncol, nlyr)
resolution : 0.05, 0.05 (x, y)
extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : elevation_glofas_v4_0.nc
varname : elevation (Height above sea level)
name : elevation
unit : m
class : SpatRaster
dimensions : 17235, 43200, 1 (nrow, ncol, nlyr)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -180, 180, -60, 83.625 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : URCA.tif
name : URCA
class : SpatRaster
dimensions : 2160, 4320, 1 (nrow, ncol, nlyr)
resolution : 0.08333333, 0.08333333 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : GAEZ-V5.AEZ33-10km.tif
name : aez
Now we also read the weather observation. The same consideration about the coordinate reference system (CRS) is still valid. When we work with raster that have also observations over time, it is important to check how and where the time and date information is stored. Sometimes it is stored in the metadata and you can access it using terra::time(), other time it is already saved as the name of the layer and you can access it using names(). Sometimes, like in this case the date information is stored in the names but the format is based on second passed from 1970-01-01 00:00. To transform this observation into readable date we can use the function second_to_date().
Note that rasters can store time information in different ways, so it may not always be possible to retrieve dates in this manner. A common alternative is for dates to be embedded in the layer names, in which case we wouldn’t need to rename the layers.
pre <- terra::rast(path_to_pre)
pre
names(pre) <- terra::names(pre) |> second_to_date()
pre
tmp <- terra::rast(path_to_tmp)
names(tmp) <- terra::names(tmp) |> second_to_date()- 1
- transform the layers name with second into dates
class : SpatRaster
dimensions : 741, 811, 904 (nrow, ncol, nlyr)
resolution : 0.1, 0.1 (x, y)
extent : -26.05, 55.05, -36.05, 38.05 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : afr_month_50_25_tpr.nc:tp
varname : tp (Total precipitation)
names : tp_va~52000, tp_va~73600, tp_va~54400, tp_va~76000, tp_va~84000, tp_va~05600, ...
unit : m, m, m, m, m, m, ...
class : SpatRaster
dimensions : 741, 811, 904 (nrow, ncol, nlyr)
resolution : 0.1, 0.1 (x, y)
extent : -26.05, 55.05, -36.05, 38.05 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : afr_month_50_25_tpr.nc:tp
varname : tp (Total precipitation)
names : 1950-01-01, 1950-02-01, 1950-03-01, 1950-04-01, 1950-05-01, 1950-06-01, ...
unit : m, m, m, m, m, m, ...
2.2.3 Administrative Boundaries
We now move to read the administrative divisions. We use the function read_geoBoundaries() to do so. This function looks for spatial polygons for the iso and lvl provided provided.
As we have the coordinates, we don’t actually need the administrative divisions for the extraction. However, we will use it to reduce the coverage of the spatial variables and to make some plots.
The same consideration about the coordinate reference system (CRS) is still valid.
adm_div <- read_geoBoundaries(path_to_adm_div, iso = "ETH", lvl = 2)
adm_div class : SpatVector
geometry : polygons
dimensions : 74, 4 (geometries, attributes)
extent : 33.00224, 47.95925, 3.400365, 14.84602 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : ID_adm_div iso adm_div_1 adm_div_2
type : <chr> <chr> <chr> <chr>
values : 1 ETH Somali Afder
2 ETH Gambela Agnuak
3 ETH SNNPR Alaba
2.3 Georeference the Surveys
As we’ve mentioned, the spatial data is georeferenced, so we need to ensure the same for the survey data. Since many households share the same coordinates, they are linked to the same location. To reduce computation time, we extract data only for the unique coordinates, rather than for each household. Moreover, we must ensure that we can later associate the correct weather data with the right household, we do this by creating an merging variable called ID.
This is handled by the prepare_coord() function, which requires the coordinates’ variable names as input.
Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The CRS is provided as an argument of the function, using the previously saved CRS from the weather data. Also the georef_coord() function requires the coordinates’ variable names as input. Usually, the WGS 84 CRS is the default coordinate references system for coordinates. In this case it matches the weather coordinate references system.
We can print the result to check the transformation. The new column, ID, is created by prepare_coord() and identifies each unique coordinate. This is used to merge the weather data with the household data.
srvy_coord <- prepare_coord(srvy,
lon_var = lon,
lat_var = lat)
srvy_coord# A tibble: 6,505 × 7
ID survey_year hhid country lat lon interview_date
<chr> <int> <chr> <chr> <dbl> <dbl> <date>
1 1 2019 051103088801903002 Ethiopia 3.61 39.0 2019-08-28
2 1 2019 051103088801903012 Ethiopia 3.61 39.0 2019-08-28
3 1 2019 051103088801903022 Ethiopia 3.61 39.0 2019-08-28
4 1 2019 051103088801903032 Ethiopia 3.61 39.0 2019-08-29
5 1 2019 051103088801903042 Ethiopia 3.61 39.0 2019-08-29
6 1 2019 051103088801903052 Ethiopia 3.61 39.0 2019-08-28
7 1 2019 051103088801903062 Ethiopia 3.61 39.0 2019-08-28
8 1 2019 051103088801903072 Ethiopia 3.61 39.0 2019-08-28
9 1 2019 051103088801903082 Ethiopia 3.61 39.0 2019-08-28
10 1 2019 051103088801903092 Ethiopia 3.61 39.0 2019-08-29
# ℹ 6,495 more rows
Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The function also the coordinates’ variable names as input.
srvy_geo <- georef_coord(srvy_coord,
geom = c("lon", "lat"),
crs = "EPSG:4326")
srvy_geo class : SpatVector
geometry : points
dimensions : 516, 1 (geometries, attributes)
extent : 33.43483, 47.30784, 3.609384, 14.47715 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : ID
type : <chr>
values : 1
2
3
Pay attention on the reduced number of observation between srvy_coord and srvy_geo. From 6505 rows to 516, these are the actual unique locations from the survey.
2.4 Crop the spatial variables
The spatial variables variables we have just load have a global coverage. It might be convenient to reduce the coverage to just the countries we are interested in. We can do this by using the crop_with_buffer() function and the administrative divisions. As the name suggest, this function allows to specify a buffer around the vector data to increase the spatial extent and crop a larger portion. This is useful as some survey coordinates are at the edge of the administrative borders or, in some rare cases, just outside the borders as consequence of the coordinates modification fro location anonymization. Further, to compute some spatial indicators in one cell we need the surrounding cell values and if we crop exactly at the borders those cell values at the edge won’t have the he surrounding cells.
The buffer argument of the function specifies the increase around the spatial extent. By default, it is in the same unit of measure of the data.
This is not a compulsory step but it reduce the memory burden and allows for more meaningful plotting.
nghtlght_cntry <- crop_with_buffer(nightlight, adm_div, buffer = 1)
elevatn_cntry <- crop_with_buffer(elevation, adm_div, buffer = 1)
urca_cntry <- crop_with_buffer(urca, adm_div, buffer = 1)
aez_cntry <- crop_with_buffer(aez, adm_div, buffer = 1)
pre_cntry <- crop_with_buffer(pre, adm_div, buffer = 1)
tmp_cntry <- crop_with_buffer(tmp, adm_div, buffer = 1)2.5 Plot
A good practice when working with spatial data is to plot it. This is the best way to verify that everything is working as expected.
First, we plot the survey coordinates to ensure they are correctly located within the country and to examine their spatial distribution.
terra::plot(adm_div, col = "grey", main = "District of Ethiopia and Survey Coordinates")
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)- 1
- plot raster
- 2
- add survey locations
We confirm that the survey locations are within the country borders, which is great! We also observe that the spatial distribution of survey coordinates is neither random nor uniform; most are concentrated near the major cities and in the North.
Next, we plot the spatial variables to see how it overlaps with the spatial coordinates.
terra::plot(elevatn_cntry, main = "Elevation")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)- 1
- plot raster
- 2
- add administrative borders
- 3
- add survey locations
terra::plot(urca_cntry, main = "URCA")
terra::lines(adm_div, col = "black", lwd = 2)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)terra::plot(log(1+nghtlght_cntry), main = "Log Nighttime Light")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("water"),
main = "Monthly precipitation at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("ryb"),
main = "Monthly temperature at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "black", alpha = 0.5, cex = 0.5)Once again, the survey coordinates align with the precipitation data, which is great! We can also observe the different spatial resolution, with precipitation having a lower one. The consequence is that some survey coordinates still fall within the same cell.
2.6 Modify the Spatial Variables
2.6.1 Compute Terrain Indicators
Now we compute some terrain indicators based on elevation. The terrain indicators are:
TRI (Terrain Ruggedness Index) is the mean of the absolute differences between the value of a cell and its 8 surrounding cells.
Slope is the average difference between the value of a cell and its 8 surrounding cells.
Roughness is the difference between the maximum and the minimum value of a cell and its 8 surrounding cells.
terrain_cntry <- terra::terrain(elevatn_cntry,
v = c("slope", "TRI", "roughness"),
neighbors = 8,
unit = "degrees")- 1
- the terrain indicators
- 2
- how many neighboring cells, 8 (queen case) or 4 (rook case)
2.6.2 Weather Variable Transformation
The original unit of measure of the weather data is in meter for precipitation and Kelvin for temperature. These unit of measure are not very intuitive, therefore we change them into millimeter and Celsius respectively.
# From meter to millimeters
pre_cntry_mm <- pre_cntry*1000
# From Kelvin to Celsius
tmp_cntry_c <- tmp_cntry - 273.152.7 Extraction
Next, we extract the spatial data based on the survey coordinates using the extract_by_coord() function. This function requires the raster with the spatial data and the georeferenced coordinates as inputs.
Looking at the result, we see first the ID column, that identifies the unique survey coordinates. The second and third column are the coordinates of the cells. The other columns contain the spatial observations, specific to each coordinate. For the weather data we have the time series of observations over time, specific to each coordinate.
nghtlght_coord <- extract_by_coord(nghtlght_cntry, srvy_geo)
nghtlght_coord# A tibble: 516 × 4
ID x_cell y_cell nightlight
<chr> <dbl> <dbl> <dbl>
1 1 39.0 3.61 0
2 2 41.8 4.01 0
3 3 41.9 4.44 0
4 4 41.5 4.73 0
5 5 36.0 4.75 0
6 6 38.1 4.85 0
7 7 37.4 4.97 0
8 8 40.7 5.11 0.816
9 9 41.9 5.15 0
10 10 44.6 5.24 0
# ℹ 506 more rows
elevation_coord <- extract_by_coord(elevatn_cntry, srvy_geo)
terrain_coord <- extract_by_coord(terrain_cntry, srvy_geo)
urca_coord <- extract_by_coord(urca_cntry, srvy_geo)
aez_coord <- extract_by_coord(aez_cntry, srvy_geo)
pre_coord <- extract_by_coord(pre_cntry, srvy_geo)
tmp_coord <- extract_by_coord(tmp_cntry, srvy_geo)
tmp_coord# A tibble: 516 × 907
ID x_cell y_cell X1950_01_01 X1950_02_01 X1950_03_01 X1950_04_01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 39 3.60 296. 298. 295. 294.
2 2 41.8 4.00 303. 304. 300. 299.
3 3 41.9 4.40 303. 304. 300. 299.
4 4 41.5 4.7 301. 302. 299. 298.
5 5 36 4.7 301. 304. 300. 299.
6 6 38.1 4.8 293. 295. 292. 292.
7 7 37.4 5.00 294. 296. 293. 292.
8 8 40.7 5.1 296. 298. 295. 294.
9 9 41.9 5.1 303. 303. 300. 298.
10 10 44.6 5.2 301. 302. 301. 300.
# ℹ 506 more rows
# ℹ 900 more variables: X1950_05_01 <dbl>, X1950_06_01 <dbl>,
# X1950_07_01 <dbl>, X1950_08_01 <dbl>, X1950_09_01 <dbl>, X1950_10_01 <dbl>,
# X1950_11_01 <dbl>, X1950_12_01 <dbl>, X1951_01_01 <dbl>, X1951_02_01 <dbl>,
# X1951_03_01 <dbl>, X1951_04_01 <dbl>, X1951_05_01 <dbl>, X1951_06_01 <dbl>,
# X1951_07_01 <dbl>, X1951_08_01 <dbl>, X1951_09_01 <dbl>, X1951_10_01 <dbl>,
# X1951_11_01 <dbl>, X1951_12_01 <dbl>, X1952_01_01 <dbl>, …
Again we have a row for each unique location from the survey. However, if we want to know how many different cells there are we can look unique cell coordinates.
unique_cell <- tmp_coord |>
dplyr::distinct(x_cell, y_cell)
nrow(unique_cell)- 1
- identifies the unique combination of the variables
[1] 365
We see that now the number of rows is 365, this is the actual different weather observations that we can merge with the survey. We start with 6505 different household, then we have 516 different survey coordinates, and we end up with 365 different weather observations.
2.8 Cmpute Long Run Climatic Parameter
We want to describe the long run climatic condition in each locations. Rule of thumb is to use 30 years of weather observations to capture climatic features. Therefore, we select the 30 years before each survey.
Check the names with the date of observations and how it has changed since before.
pre_coord_30yrs <- select_by_dates(pre_coord, from = "1989", to = "2019")
tmp_coord_30yrs <- select_by_dates(tmp_coord, from = "1989", to = "2019")
tmp_coord_30yrs# A tibble: 516 × 364
ID x_cell y_cell X1989_01_01 X1989_02_01 X1989_03_01 X1989_04_01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 39 3.60 297. 297. 297. 294.
2 2 41.8 4.00 303. 304. 303. 300.
3 3 41.9 4.40 303. 304. 303. 300.
4 4 41.5 4.7 301. 301. 301. 299.
5 5 36 4.7 303. 303. 304. 302.
6 6 38.1 4.8 294. 295. 294. 292.
7 7 37.4 5.00 296. 296. 296. 294.
8 8 40.7 5.1 297. 297. 297. 295.
9 9 41.9 5.1 303. 303. 303. 300.
10 10 44.6 5.2 301. 301. 303. 302.
# ℹ 506 more rows
# ℹ 357 more variables: X1989_05_01 <dbl>, X1989_06_01 <dbl>,
# X1989_07_01 <dbl>, X1989_08_01 <dbl>, X1989_09_01 <dbl>, X1989_10_01 <dbl>,
# X1989_11_01 <dbl>, X1989_12_01 <dbl>, X1990_01_01 <dbl>, X1990_02_01 <dbl>,
# X1990_03_01 <dbl>, X1990_04_01 <dbl>, X1990_05_01 <dbl>, X1990_06_01 <dbl>,
# X1990_07_01 <dbl>, X1990_08_01 <dbl>, X1990_09_01 <dbl>, X1990_10_01 <dbl>,
# X1990_11_01 <dbl>, X1990_12_01 <dbl>, X1991_01_01 <dbl>, …
Now we can compute the long run climatic parameter. We calculate the mean, the standard deviation, and the coefficient of variation. We collect all the parameter in a separate object parameter. This object is a names list of functions and we construct it with this structure name = function, then the list() function puts them together. This passage is not compulsory but allows to perform the computation of multiple parameters in a tidy and efficient way. Otherwise we could have directly add them inside the calc_par().
The function calc_par() calculates the required parameters.
The results have a similar structure, with the first columns that identify the specific locations and the other the computed parameters. Note how we are still carrying on the coverage_fraction variable as we will need it for aggregating the climatic parameter at the administrative division.
parameter <- list(std = sd, avg = mean, coef_var = cv)
parameter$std
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x11edeff30>
<environment: namespace:stats>
$avg
function (x, ...)
UseMethod("mean")
<bytecode: 0x10c127cf8>
<environment: namespace:base>
$coef_var
function(x, na_rm = TRUE) {
avg <- mean(x, na.rm = na_rm)
if (is.nan(avg)) return(NA_real_)
if (avg == 0) return(NA_real_) # Avoid division by zero
sd(x, na.rm = na_rm) / mean(x, na.rm = na_rm)
}
<bytecode: 0x11ede4320>
<environment: namespace:climatic4economist>
pre_par_coord <- calc_par(pre_coord_30yrs, pars = parameter, prefix = "pre")
tmp_par_coord <- calc_par(tmp_coord_30yrs, pars = parameter, prefix = "tmp")
tmp_par_coord# A tibble: 516 × 6
ID x_cell y_cell tmp_std tmp_avg tmp_coef_var
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 39 3.60 1.71 295. 0.00580
2 10 44.6 5.2 1.05 302. 0.00347
3 100 37.9 7.5 1.35 290. 0.00464
4 101 45 7.5 1.17 298. 0.00392
5 102 36.8 7.6 1.31 291. 0.00449
6 103 34.3 7.7 2.45 302. 0.00811
7 104 38.3 7.7 1.41 292. 0.00481
8 105 34.1 7.7 2.45 302. 0.00812
9 106 37.4 7.7 1.79 293. 0.00613
10 107 37.5 7.8 1.55 291. 0.00534
# ℹ 506 more rows
2.9 Merge with Survey
Now that we have everything, we can combine all the extracted data and then merge them with the survey. We start by combining the data into a unique data set. To do so we start by create a list with the function list(), each element of the list is a different spatial variable.
We need to drop the cell coordinates. This because, the cell resolution is different among the spatial datasets, despite the same CRS, and thus the coordinates of the cells are slightly different among datasets. This impinges the merge and at this stage of the analysis we don’t need the information they are carrying anymore.
To drop them requires a convoluted approach, but it takes advantage that all the datasets are grouped in the same list and the procedure is the same for each dataset. To apply a function to each element of a list we can use the function purrr::map(). As arguments, this function requires the function we want to apply, namely dplyr::select, and additional arguments for the function, -c(x_cell, y_cell) which are the columns we want to to drop. Note the minus symbol as it tells the function we want to drop the columns and not keep them.
Then we combine the elements of the list with the function purrr::reduce(). This last function require another function as input to drive the combination and we choose to use merge_by_common(), which merges two data by their common variable names.
Why not using directly merge_by_common()? Because the function works with just two datasets and we have eight different spatial datasets. We can cumulatively merge the datasets one by one or we can use the purrr::reduce().
Then we compute the logarithmic transformation of the nighttime light, with the dplyr::mutate() function. We use the argument .after to specify where the position of the variable.
sptl_coord <- list(nghtlght_coord,
terrain_coord,
elevation_coord,
urca_coord,
aez_coord,
pre_par_coord,
tmp_par_coord) |>
purrr::map(dplyr::select, -c(x_cell, y_cell)) |>
purrr::reduce(merge_by_common) |>
dplyr::mutate(ln_nightlight = log(1+nightlight), .after = nightlight)
sptl_coord- 1
- combine the data into a list
- 2
- remove variables from each element of the list
- 3
- merge all the elements of the list
- 4
- create new variable
# A tibble: 516 × 15
ID nightlight ln_nightlight slope TRI roughness elevation URCA aez
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 0 0 0.394 37.7 138. 1140. 13 1
2 2 0 0 0.274 23.7 67.0 249. 19 26
3 3 0 0 0.0540 16.3 38.1 196. 12 26
4 4 0 0 0.961 73.5 215. 466. 30 26
5 5 0 0 0.0237 1.94 7.63 369. 13 29
6 6 0 0 2.10 195. 619. 1838. 12 26
7 7 0 0 0.939 81.4 252. 1514. 19 1
8 8 0.816 0.597 0.625 42.6 155. 1215. 5 1
9 9 0 0 0.264 18.2 60.9 281. 12 26
10 10 0 0 0.512 38.3 149. 241. 20 29
# ℹ 506 more rows
# ℹ 6 more variables: pre_std <dbl>, pre_avg <dbl>, pre_coef_var <dbl>,
# tmp_std <dbl>, tmp_avg <dbl>, tmp_coef_var <dbl>
Now that we have all the control variables together, we can merge them with the surveys information. The function merge_by_common() will do it for us.
We can see that the result has all the information we retained from the surveys and the new extracted spatial variables.
srvy_sptl_coord <- merge_by_common(srvy_coord, sptl_coord)
srvy_sptl_coord# A tibble: 6,505 × 21
ID survey_year hhid country lat lon interview_date nightlight
<chr> <int> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
2 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
3 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
4 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-29 0
5 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-29 0
6 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
7 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
8 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
9 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-28 0
10 1 2019 051103088801… Ethiop… 3.61 39.0 2019-08-29 0
# ℹ 6,495 more rows
# ℹ 13 more variables: ln_nightlight <dbl>, slope <dbl>, TRI <dbl>,
# roughness <dbl>, elevation <dbl>, URCA <int>, aez <int>, pre_std <dbl>,
# pre_avg <dbl>, pre_coef_var <dbl>, tmp_std <dbl>, tmp_avg <dbl>,
# tmp_coef_var <dbl>
2.10 Write
Here we are at the end, let’s save the results. We want to save the result as dta so we will use the haven::write_dta() function.
haven::write_dta(srvy_sptl_coord,
file.path(path_to_result, "ETH_sp_coord.dta"))3 Take home messages
When working with multiple spatial data:
- remember to control the Coordinate Reference System of all dataset
- plot the data to check everything is going well
based on the typology of data we use different function to read them
- for
dtausehaven::read_dta() orandhaven::_write_dta() - for spatial vectors
read_GAUL()for administrative divisions in specific country and level, otherwiseterra::vect() - for spatial raster use
terra::rast()
- for
Extraction
- since many household share the same locations, we take advantage of this by extracting the weather data only at the unique locations, we achieve this with the function
prepare_coord(). - same of these unique locations may fall within the same value cells, so the actual information might be even lower.
- since many household share the same locations, we take advantage of this by extracting the weather data only at the unique locations, we achieve this with the function
When working with raster data
- check the unit of measure
- if it is a time series check also the date format
When working with survey and time series data, remember to select the appropriate observations, for example not those that happened after the interview.
4 Appendix
4.1 Want to know about the data?
4.1.1 Weather
Weather observation are obtained from ERA5-Land reanalysis dataset. H-TESSEL is the land surface model that is the basis of ERA5-Land. The data is a post-processed monthly-mean average of the original ERA5-Land dataset.
| Parameter | Value |
|---|---|
| spatial resolution | 0.1° x 0.1° lon lat |
| temporal resolution | month |
| time frame | Jan. 1950 - Dec. 2022 |
| unit of measure | meter or Kelvin |
Suggested citation:
- Muñoz Sabater, J. (2019): ERA5-Land monthly averaged data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.68d2bb30
It is possible to find additional information:
The data can be freely download from
- here.
4.1.1.0.1 Total precipitation
Accumulated liquid and frozen water, including rain and snow, that falls to the Earth’s surface. It is the sum of large-scale precipitation and convective precipitation. Precipitation variables do not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth.
4.1.1.0.2 2 metre above ground temperature
Temperature of air at 2m above the surface of land, sea or in-land waters. 2m temperature is calculated by interpolating between the lowest model level and the Earth’s surface, taking account of the atmospheric conditions.
4.1.2 Spatial variables
4.1.2.1 Agro Ecological Zones
The Agro-ecological Zones classification (33 classes) provides a characterization of bio-physical resources relevant to agricultural production systems. AEZ definitions and map classes follow a rigorous methodology and an explicit set of principles. The inventory combines spatial layers of thermal and moisture regimes with broad categories of soil/terrain qualities. It also indicates locations of areas with irrigated soils and shows land with severely limiting bio-physical constraints including very cold and very dry (desert) areas as well as areas with very steep terrain or very poor soil/terrain conditions. The AEZ classification dataset is part of the GAEZ v5 Land and Water Resources theme and Agro-ecological Zones sub-theme. All results are derived from the Agro-ecological Zones (AEZ) modeling framework, developed collaboratively by the Food and Agriculture Organization (FAO) and the International Institute for Applied Systems Analysis (IIASA).
| Parameter | Value |
|---|---|
| spatial resolution | 10 km. |
| temporal resolution | 20 years |
| time frame | 2001–2020 |
| unit of measure | classification by climate/soil/terrain/LC (33 classes) |
Suggested citation:
- FAO & IIASA. 2025. Global Agro-ecological Zoning version 5 (GAEZ v5) Model Documentation. https://github.com/un-fao/gaezv5/wiki
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.2 Urban-Rural Catchment Area (URCA)
Urban–rural catchment areas showing the catchment areas around cities and towns of different sizes (the no data value is 128). Each rural pixel is assigned to one defined travel time category to one of seven urban agglomeration sizes.
| Parameter | Value |
|---|---|
| spatial resolution | 0.03° x 0.03° lon lat |
| temporal resolution | year |
| time frame | 2015 |
| unit of measure | travel time category to different urban hierarchy |
Suggested citation:
- Cattaneo, Andrea; Nelson, Andy; McMenomy, Theresa (2020). Urban-rural continuum. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12579572.v4
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.3 Population
The units are number of people per pixel. The mapping approach is Random Forest-based dasymetric redistribution.
| Parameter | Value |
|---|---|
| spatial resolution | 30 arc second (~1km) |
| temporal resolution | year |
| time frame | 2010 - 2020 |
| unit of measure | estimated count of people per grid-cell |
Suggested citation:
- WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00647
It is possible to find additional information from:
The data can be freely download from:
- here.
4.1.2.4 Nighttime light
VIIRS nighttime lights (VNL) version V2.1: annual values obtained by from the monthly averages with filtering to remove extraneous features such as biomass burning, aurora, and background.
| Parameter | Value |
|---|---|
| spatial resolution | 15 arc second |
| temporal resolution | year |
| time frame | 2012 - 2021 |
| unit of measure | nW/cm2/sr, average-masked |
Suggested citation:
- Elvidge, C.D, Zhizhin, M., Ghosh T., Hsu FC, Taneja J. Annual time series of global VIIRS nighttime lights derived from monthly averages:2012 to 2019. Remote Sensing 2021, 13(5), p.922, doi:10.3390/rs13050922
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.5 Elevation
The Global Flood Awareness System (GloFAS) is one component of the Copernicus Emergency Management Service (CEMS). It is designed to support preparatory measures for flood events worldwide, particularly in large transnational river basins.
Elevation is obtained from the auxiliary variables of GloFAS. Each pixel is the mean height elevation above sea level.
| Parameter | Value |
|---|---|
| spatial resolution | 0.03° x 0.03° lon lat |
| temporal resolution | 30 years |
| time frame | 1981 - 2010 |
| unit of measure | Meter (m) |
Web resources:
Data access:
4.1.3 Survey
The Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) is a unique system of longitudinal surveys designed to improve the understanding of household and individual welfare, livelihoods and smallholder agriculture in Africa. The LSMS team works with national statistics offices to design and implement household surveys with a strong focus on agriculture.
Suggested citation:
- Central Statistics Agency of Ethiopia. (2020). Socioeconomic Survey 2018-2019 [Data set]. World Bank, Development Data Group. https://doi.org/10.48529/K739-C548
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.4 Administrative boundaries
The administrative divisions are obtained from GeoBoundaries[^2]. GeoBoundaries Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries Database is an online, open license (CC BY 4.0) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, we have tracked approximately 1 million boundaries within over 200 entities, including all UN member states.
Suggested citation:
- Runfola D, Anderson A, Baier H, Crittenden M, Dowker E, Fuhrig S, et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866.
It is possible to find additional information:
- here.
The data can be freely download from:
- here.